Nunavut
Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data
Ji, Shaoxiong, Li, Zihao, Paavola, Jaakko, Luo, Hengyu, Tiedemann, Jörg
This paper investigates a critical design decision in the practice of massively multilingual continual pre-training -- the inclusion of parallel data. Specifically, we study the impact of bilingual translation data for massively multilingual language adaptation of the Llama3 family of models to 500 languages. To this end, we construct the MaLA bilingual translation corpus, containing data from more than 2,500 language pairs. Subsequently, we develop the EMMA-500 Llama 3 suite of four massively multilingual models -- continually pre-trained from the Llama 3 family of base models extensively on diverse data mixes up to 671B tokens -- and explore the effect of continual pre-training with or without bilingual translation data. Comprehensive evaluation across 7 tasks and 12 benchmarks demonstrates that bilingual data tends to enhance language transfer and performance, particularly for low-resource languages. We open-source the MaLA corpus, EMMA-500 Llama 3 suite artefacts, code, and model generations.
Segmentation Beyond Defaults: Asymmetrical Byte Pair Encoding for Optimal Machine Translation Performance
Yadav, Saumitra, Shrivastava, Manish
Existing Machine Translation (MT) research often suggests a single, fixed set of hyperparameters for word segmentation models, symmetric Byte Pair Encoding (BPE), which applies the same number of merge operations (NMO) to train tokenizers for both source and target languages. However, we demonstrate that this uniform approach doesn't guarantee optimal MT performance across different language pairs and data sizes. This work investigates BPE segmentation recipes across various data volumes and language pairs to evaluate MT system performance. We find that utilizing asymmetric BPE, where the source and target languages have different NMOs, significantly improves results over the symmetric approach, especially in low-resource settings (50K, 100K, and 500K sentence pairs). Specifically, asymmetric BPE yield statistically significant ($p<0.05$) average gains of 5.32, 4.46, and 0.7 CHRF++ on English-Hindi in low-resource setups (50K, 100K, and 500K sentence pairs, respectively). We validated this trend across six additional language pairs (English and Telugu, Shona, Norwegian, Kyrgyz, Hausa, and Inuktitut), observing statistically significant improvement in 10 out of 12 systems compared to symmetric BPE. Our findings indicate a high NMO for the source (4K to 32K) and a low NMO for the target (0.5K to 2K) provides optimal results, particularly benefiting low-resource MT.
- North America > United States (0.25)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- (13 more...)
Rhinos once lived in Canada
A newly discovered species of Arctic rhino lived 23 million years ago. Breakthroughs, discoveries, and DIY tips sent every weekday. About 23 million years ago, a rhinoceros stomped across the Canadian High Arctic . Now extinct, a team of scientists from the Canadian Museum of Nature (CMN) have found a new species of the enigmatic "Arctic rhino." First uncovered almost 40 years ago in lake deposits in Haughton Crater on Devon Island, Nunavut, was more petite than many of its modern descendants.
- North America > Canada > Nunavut (0.25)
- Europe (0.07)
- South America (0.05)
- (4 more...)
Learning Coupled Earth System Dynamics with GraphDOP
Boucher, Eulalie, Alexe, Mihai, Lean, Peter, Pinnington, Ewan, Lang, Simon, Laloyaux, Patrick, Zampieri, Lorenzo, de Rosnay, Patricia, Bormann, Niels, McNally, Anthony
Interactions between different components of the Earth System (e.g. ocean, atmosphere, land and cryosphere) are a crucial driver of global weather patterns. Modern Numerical Weather Prediction (NWP) systems typically run separate models of the different components, explicitly coupled across their interfaces to additionally model exchanges between the different components. Accurately representing these coupled interactions remains a major scientific and technical challenge of weather forecasting. GraphDOP is a graph-based machine learning model that learns to forecast weather directly from raw satellite and in-situ observations, without reliance on reanalysis products or traditional physics-based NWP models. GraphDOP simultaneously embeds information from diverse observation sources spanning the full Earth system into a shared latent space. This enables predictions that implicitly capture cross-domain interactions in a single model without the need for any explicit coupling. Here we present a selection of case studies which illustrate the capability of GraphDOP to forecast events where coupled processes play a particularly key role. These include rapid sea-ice freezing in the Arctic, mixing-induced ocean surface cooling during Hurricane Ian and the severe European heat wave of 2022. The results suggest that learning directly from Earth System observations can successfully characterise and propagate cross-component interactions, offering a promising path towards physically consistent end-to-end data-driven Earth System prediction with a single model.
- North America > United States (0.68)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Atlantic Ocean > North Atlantic Ocean > Baffin Bay (0.04)
- (12 more...)
- Government > Regional Government > North America Government > United States Government (0.47)
- Energy (0.46)
Interpretable Time Series Autoregression for Periodicity Quantification
Chen, Xinyu, Digalakis, Vassilis Jr, Ding, Lijun, Zhuang, Dingyi, Zhao, Jinhua
Time series autoregression (AR) is a classical tool for modeling auto-correlations and periodic structures in real-world systems. We revisit this model from an interpretable machine learning perspective by introducing sparse autoregression (SAR), where $\ell_0$-norm constraints are used to isolate dominant periodicities. We formulate exact mixed-integer optimization (MIO) approaches for both stationary and non-stationary settings and introduce two scalable extensions: a decision variable pruning (DVP) strategy for temporally-varying SAR (TV-SAR), and a two-stage optimization scheme for spatially- and temporally-varying SAR (STV-SAR). These models enable scalable inference on real-world spatiotemporal datasets. We validate our framework on large-scale mobility and climate time series. On NYC ridesharing data, TV-SAR reveals interpretable daily and weekly cycles as well as long-term shifts due to COVID-19. On climate datasets, STV-SAR uncovers the evolving spatial structure of temperature and precipitation seasonality across four decades in North America and detects global sea surface temperature dynamics, including El Niño. Together, our results demonstrate the interpretability, flexibility, and scalability of sparse autoregression for periodicity quantification in complex time series.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > Canada > Quebec > Montreal (0.14)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (14 more...)
Narwhals spotted using tusks for non-mating fun
With their long, spiral tusks, narwhals (Monodon monoceros) look like something out of a fairy tale. Primarily seen in male narwhals, these single elongated teeth that can grow up to 10 feet. These gregarious whales typically travel in pods of two to 10 individuals, but are a bit elusive and difficult to study in the wild. Scientists believe that the tusks are primarily used in competition for mates, but that might not be the whole story. New drone evidence detailed in a study published February 28 in the journal Frontiers in Marine Science found that narwhals can use their tusks to forage, explore their surroundings, and even play.
- North America > United States > North Dakota > McKenzie County (0.05)
- North America > Canada > Nunavut (0.05)
- Research Report > New Finding (0.52)
- Personal (0.36)
COPU: Conformal Prediction for Uncertainty Quantification in Natural Language Generation
Wang, Sean, Jiang, Yicheng, Tang, Yuxin, Cheng, Lu, Chen, Hanjie
Uncertainty Quantification (UQ) for Natural Language Generation (NLG) is crucial for assessing the performance of Large Language Models (LLMs), as it reveals confidence in predictions, identifies failure modes, and gauges output reliability. Conformal Prediction (CP), a model-agnostic method that generates prediction sets with a specified error rate, has been adopted for UQ in classification tasks, where the size of the prediction set indicates the model's uncertainty. However, when adapting CP to NLG, the sampling-based method for generating candidate outputs cannot guarantee the inclusion of the ground truth, limiting its applicability across a wide range of error rates. To address this, we propose \ourmethod, a method that explicitly adds the ground truth to the candidate outputs and uses logit scores to measure nonconformity. Our experiments with six LLMs on four NLG tasks show that \ourmethod outperforms baseline methods in calibrating error rates and empirical cover rates, offering accurate UQ across a wide range of user-specified error rates.
- Europe > Austria > Vienna (0.14)
- Europe > France (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (21 more...)
- Leisure & Entertainment (1.00)
- Government (0.68)
- Automobiles & Trucks > Manufacturer (0.46)
Meta and UNESCO team up to improve translation AI
Meta has partnered with UNESCO on a new plan to improve translation and speech recognition AI, Techcrunch reported. As part of its Language Technology Partner Program, Meta is seeking collaborators willing to donate at least 10 hours of speech recordings with transcriptions, large written texts (200-plus sentences) and sets of translated sentences. The aim is to focus on "underserved languages, in support of UNESCO's work," Meta wrote in a blog post. So far, Meta and UNESCO have signed on the government of Nunavut, a northern Canadian territory. The aim is to develop translation systems for the Intuit languages used there, Inuktitut and Inuinnaqtun.
- North America > Canada > Nunavut (0.28)
- North America > United States (0.08)